EXPLORATORY DATA ANALYSIS – DIAMONDS by Wei Tang
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.01200 Min. : 1.00 Min. : 6.00
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00
## Median :0.07900 Median :14.00 Median : 38.00
## Mean :0.08747 Mean :15.87 Mean : 46.47
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00
## Max. :0.61100 Max. :72.00 Max. :289.00
## density pH sulphates alcohol
## Min. :0.9901 Min. :2.740 Min. :0.3300 Min. : 8.40
## 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50
## Median :0.9968 Median :3.310 Median :0.6200 Median :10.20
## Mean :0.9967 Mean :3.311 Mean :0.6581 Mean :10.42
## 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10
## Max. :1.0037 Max. :4.010 Max. :2.0000 Max. :14.90
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.636
## 3rd Qu.:6.000
## Max. :8.000
Our dataset consists of 12 variables, with 1599 observations. Quality variable is discrete and the others are continuous.
Red wine quality is normally distributed and concentrated around 5 and 6.
Red wine quality is normally distributed and concentrated on average.
Tip: Make sure that you leave a blank line between the start / end of each code block and the end / start of your Markdown text so that it is formatted nicely in the knitted text. Note as well that text on consecutive lines is treated as a single space. Make sure you have a blank line between your paragraphs so that they too are formatted for easy readability.
The distribution of fixed acidity is right skewed, and concentrated around 8
cutting the outliers
The distribution of citric acid is not normal.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
The distribution of chlorides is right skewed, and concentrated around 0.08 There are a few outliers on this plot.
The distribution of chlorides is right skewed There are a few outliers on this plot.
The distribution of chlorides is right skewed There are a few outliers on this plot.
The distribution of density is normal and concentrated around 0.9967
We divide the data into 2 groups: high quality group contains observations whose quality is 7 or 8, and low quality group has observations whose quality is 3 or 4. After examining the difference in each feature between the two groups, we see that volatile acidity, density, and citric acid may have some correation with quality. Let’s visualize the data to see the difference.
The Low volatility,the better quality.
it seems that density has no correlation with quality
the higher the citric.acid ,the better the quality is
Tip: Now that you’ve completed your univariate explorations, it’s time to reflect on and summarize what you’ve found. Use the questions below to help you gather your observations and add your own if you have other thoughts!
There are 1,599 red wines in the dataset with 11 features on the chemical properties of the wine. ( fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, density, pH, sulphates, alcohol, and quality).
i would like to know which factor determine or has corelation with the quality of the wine.
Volatile acidity, citric acid, and alcohol seems contribute to the quality of a wine.However, density seems has no relationship with the qualoty of the wine.
I think volatile acidityl,citric acid, alcohol probably contribute most to the quality.
Yes, I created a new variable quality.level. It divided into the quality as “low”, “average”, and “high”.
Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this? This dataset is preety clean, I think i dont have to do the cleaning job.
Tip: Based on what you saw in the univariate plots, what relationships between variables might be interesting to look at in this section? Don’t limit yourself to relationships between a main output feature and one of the supporting variables. Try to look at relationships between supporting variables as well.
The graph shows a very clear trend; the lower volatile acidity is, the higher the quality becomes.
The graph shows there is no positive relationship between quality level and citric acid.
The graph shows there is no positive relationship between quality level and citric acid.
The correlation coefficient of 0.476, the graph shows a positive relationship between alcohol and quality level.
A weak negative correlation of -0.2 exists between percent alcohol content and volatile acidity.
The correlation coefficient is 0.04, which indicates that there is almost no relationship between residual sugar and percent alcohol content. However,the most wine are contrented on the low sugar area.
There is a negative correlation between citric acid and volatile acidity.
Tip: As before, summarize what you found in your bivariate explorations here. Use the questions below to guide your discussion.
I found a negative relationships between quality level and volatile acidity, and positive correlation between quality level and alcohol.
The correlation coefficient between sugar and alcohol is 0.04, which indicates that there is almost no relationship between residual sugar and percent alcohol content. However,the most wine are contrented on the low sugar area.
a negative relationships between quality level and volatile acidity, and positive correlation between quality level and alcohol.
Tip: Now it’s time to put everything together. Based on what you found in the bivariate plots section, create a few multivariate plots to investigate more complex interactions between variables. Make sure that the plots that you create here are justified by the plots you explored in the previous section. If you plan on creating any mathematical models, this is the section where you will do that.
The densities of high quality wines are concentrated between 0.994 and 0.998, and the lower part of volatile acidity.
The alcohol ranging from 10 to 13,the volatile acidity from 0.2 to 0.5 seems to be high quality wine.
The densities of high quality wines are concentrated between 0.994 and 0.998, and the lower part of volatile acidity.
The alcohol ranging from 10 to 13,the volatile acidity from 0.2 to 0.5 seems to be high quality wine.
Tip: You’ve done a lot of exploration and have built up an understanding of the structure of and relationships between the variables in your dataset. Here, you will select three plots from all of your previous exploration to present here as a summary of some of your most interesting findings. Make sure that you have refined your selected plots for good titling, axis labels (with units), and good aesthetic choices (e.g. color, transparency). After each plot, make sure you justify why you chose each plot by describing what it shows.
This plot revels that the marjority part of wines are rated 5 and 6. There exists no wine that is rated 1, 2, 9 or 10.In other word,there is no extramely bad quality wine and no extramely good wine.
The graph shows a very clear trend; the lower volatile acidity is, the higher the quality becomes.
The wines data set contains data on 1599 wines columns and 12 variables from around 2009. I started by asking some questions,then explored with the dataset,created several plots to answer my questions. First of all,I made a assumption that factors might affect the quality of the wine. For example, pH was negatively correlated to volatile acidity, which makes sense.
I created a linear model to attempt to predict red wine qualities, which was accurate for average wines but extremely inaccurate for bad/excellent wines; it either over predicted bad wines and underpredicted the good ones.
Alcohol appeared to be the number one factor for determining an excellent wine. However,Citric acid and sulphates had to be in specific amounts in order for alcohol to take over. Volatile acidity made a wine bad in large amounts.
The most diffcult part for me was spenting a lot of time to play around with those 12 variable then decide which fact is interesting. Also,I struggled with choosing the most appropriate graph for the plots.